Evaluation of decision forests on text categorization

نویسندگان

  • Hao Chen
  • Tin Kam Ho
چکیده

Text categorization is useful for indexing documents for information retrieval, ltering parts for document understanding, and summarizing contents of documents of special interests. We describe a text categoriza-tion task and an experiment using documents from the Reuters and OHSUMED collections. We applied the Decision Forest classiier and compared its accuracies to those of C4.5 and kNN classiiers, using both category dependent and category independent term selection schemes. It is found that Decision Forest outperforms both C4.5 and kNN in all cases, and that category dependent term selection yields better accuracies. Performances of all three classiiers degrade from the Reuters collection to the OHSUMED collection, but Decision Forest remains to be superior.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

E-mail Classification by Decision Forests

We investigate the use of decision forests for automated e-mail filing into folders and junk e-mail filtering. The experiments show that decision forests offer the following advantages: (i) ability to deal with the large dimensionality of feature vectors in text categorization, (ii) improved accuracy of the ensemble over the single decision trees and favourable comparison with a number of other...

متن کامل

Image Categorization Using Scene-Context Scale Based on Random Forests

Scene-context plays an important role in scene analysis and object recognition. Among various sources of scene-context, we focus on scene-context scale, which means the effective scale of local context to classify an image pixel in a scene. This paper presents random forests based image categorization using the scene-context scale. The proposed method uses random forests, which are ensembles of...

متن کامل

An Improved Random Forest Classifier for Text Categorization

This paper proposes an improved random forest algorithm for classifying text data. This algorithm is particularly designed for analyzing very high dimensional data with multiple classes whose well-known representative data is text corpus. A novel feature weighting method and tree selection method are developed and synergistically served for making random forest framework well suited to categori...

متن کامل

Evaluation of Stemming and Stop Word Techniques on Text Classification Problem

Now-a-days a huge amount of information is available over the internet in electronic format. This large amount of data can be analyzed to maximize the benefits, for intelligent decision making. Text categorization is an important and extensively studied problem in machine learning. The basic phases in text categorization include preprocessing features, extracting relevant features against the f...

متن کامل

New stemming for arabic text classification using feature selection and decision trees

In this paper we conduct a comparative study between two stemming algorithms: khoja stemmer and our new stemmer for Arabic text classification (categorization), using Chisquare statistics as feature selection and focusing on decision tree classifier. Evaluation used a corpus that consists of 5070 documents independently classified into six categories: sport, entertainment, business, middle east...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000